Using Error-Correcting Codes with Co-Training for text classification with a large number of categories

نویسنده

  • Rayid Ghani
چکیده

A major concern with supervised learning techniques for text classification is that they often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn effectively from a small number of labeled examples augmented with a large number of unlabeled examples. In this paper, we develop a framework to incorporate unlabeled data in the Error-Correcting Output Coding (ECOC) setup by decomposing multiclass problems into multiple binary problems and then use Co-Training to learn the individual binary classification problems. We show that our method is especially useful for classification tasks involving a large number of categories where Co-training doesn’t perform very well by itself and when combined with ECOC, outperforms several other algorithms that combine labeled and unlabeled data for text classification. We also find that our combination method outperforms EM in terms of accuracy and is also able to provide high-precision results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

KDD Project Report Using Error-Correcting Codes for Efficient Text Classification with a Large Number of Categories

We investigate the use of Error-Correcting Output Codes (ECOC) for efficient text classification with a large number of categories and propose several extensions which improve the performance of ECOC. ECOC has been shown to perform well for classification tasks, including text classification, but it still remains an under-explored area in ensemble learning algorithms. We explore the use of erro...

متن کامل

Combining Labeled and Unlabeled Data for Text Classification with a Large Number of Categories

We develop a framework to incorporate unlabeled data in the Error-Correcting Output Coding (ECOC) setup by decomposing multiclass problems into multiple binary problems and then use Co-Training to learn the individual binary classification problems. We show that our method is especially useful for classification tasks involving a large number of categories where Co-training doesn’t perform very...

متن کامل

Combining Labeled and Unlabeled Data for MultiClass Text Categorization

Supervised learning techniques for text classi cation often require a large number of labeled examples to learn accurately. One way to reduce the amount of labeled data required is to develop algorithms that can learn e ectively from a small number of labeled examples augmented with a large number of unlabeled examples. Current text learning techniques for combining labeled and unlabeled, such ...

متن کامل

An Approach to Increasing Reliability Using Syndrome Extension

Computational errors in numerical data processing may be detected efficiently by using parity values associated with real number codes, even when inherent round off errors are allowed in addition to failure disruptions. This paper examines correcting turbo codes by straightforward application of an algorithm derived for finite-field codes, modified to operate over any field. There are syndromes...

متن کامل

Using Error-Correcting Codes for Text Classification

This paper explores in detail the use of Error Correcting Output Coding (ECOC) for learning text classifiers. We show that the accuracy of a Naive Bayes Classifier over text classification tasks can be significantly improved by taking advantage of the error-correcting properties of the code. We also explore the use of different kinds of codes, namely Error-Correcting Codes, Random Codes, and Do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007